Semantic Word Clouds with Background Corpus Normalization and t-distributed Stochastic Neighbor Embedding
نویسندگان
چکیده
Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word signi cance and word a nity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywords based on their signi cance and construct the word cloud based on the derived a nity. Based on a modi ed t-distributed stochastic neighbor embedding (t-SNE), we generate a semantic word placement. For words that cooccur signi cantly, we include edges, and cluster the words according to their cooccurrence. For this we designed a scalable and memory-e cient sketch-based approach usable on commodity hardware to aggregate the required corpus statistics needed for normalization, and for identifying keywords as well as signi cant cooccurences. We empirically validate our approch using a large Wikipedia corpus.
منابع مشابه
Distinguish Polarity in Bag-of-Words Visualization
Neural network-based BOW models reveal that wordembedding vectors encode strong semantic regularities. However, such models are insensitive to word polarity. We show that, coupled with simple information such as word spellings, word-embedding vectors can preserve both semantic regularity and conceptual polarity without supervision. We then describe a nontrivial modification to the t-distributed...
متن کاملBetter Word Embeddings for Korean
Vector representations of words that capture semantic and syntactic information accurately is critical for the performance of models that use these vectors as inputs. Algorithms that only use the surrounding context at the word level ignore the subword level relationships which carry important meaning especially for languages that are highly inflected such as Korean. In this paper we compare th...
متن کاملSyntactico Semantic Word Representations in Multiple Languages
Our project is an extension of the project “Syntactico Semantic Word Representations in Multiple Languages”[1]. The previous project aims to improve the semantical representation of English vocabulary via incorporating the local context with global context and supplying homonymy and polysemy for multiple embeddings per word. It also introduces a new neural network architecture that learns the w...
متن کاملText comparison using word vector representations and dimensionality reduction
This paper describes a technique to compare large text sources using word vector representations (word2vec) and dimensionality reduction (tSNE) and how it can be implemented using Python. The technique provides a bird’s-eye view of text sources, e.g. text summaries and their source material, and enables users to explore text sources like a geographical map. Word vector representations capture m...
متن کاملTemporal Semantic Analysis and Visualization of Words
Today there are many languages spoken in the world, among which English is the most popular one. However, words in English evolved a lot in history such that it is very difficult for contemporary people to read ancient English articles. There are many changes, such as the mutation of word itself, the migration of word usage from one context to another, etc. It is thus very interesting to unders...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.03569 شماره
صفحات -
تاریخ انتشار 2017